Sample covariance between variables \(x=x_t\) and \(y=y_t\):
\[ c_{xy} = \frac{1}{N} \sum_i^N (x_i-\overline{x}) (y_i-\overline{y}) \]
Sample cross-covariance function for positive values of lag between variables \(x_t\) and \(y_{t+k}\) (Chatfield, The Analysis of Time Series, 2004):
\[ c_{xy}(k) = \frac{1}{N} \sum_{t=1}^{N-k} (x_t-\overline{x})(y_{t+k}-\overline{y}) \]
Pearson’s correlation coefficient (sample correlation) is defined as the covariance of two variables divided by the product of their standard deviations (which are the square roots of their respective variances):
\[ r_{xy} = \frac{c_{xy}}{\sqrt{c_{xx}c_{yy}}} %= \frac{\sum (x_i-\overline{x})(y_i-\overline{y})}{\sqrt{ \sum (x_i-\overline{x})^2 \sum (y_i-\overline{y})^2 }} \]
The sample cross-correlation function: \[ r_{xy}(k) = \frac{c_{xy}(k)}{\sqrt{c_{xx}(0)c_{yy}(0)}} \]
\(c_{xx}\) and \(c_{yy}\) are the sample variances of \(x\) and \(y\), respectively.
\(c_{xx}(0)\) and \(c_{yy}(0)\) that are the sample variances of \(x_t\) and \(y_t\) respectively.
“For descriptive purposes, the relationship will be described as strong if \(|r| \geq .8\), moderate if \(.5 < |r| <.8\), and weak if \(|r| \leq .5\).” – Devore and Berk, Modern Mathematical Statistics with Applications, 2012
Anscombe’s quartet classically illustrates the pitfalls on relying on a single coefficient – always visualize your data. Consider the following four datasets:
All have similar statistical properties.
| set | mean of x | mean of y | variance of x | variance of y | correlation | intercept | slope |
|---|---|---|---|---|---|---|---|
| 1 | 9 | 7.5 | 11 | 4.127 | 0.82 | 3 | 0.5 |
| 2 | 9 | 7.5 | 11 | 4.128 | 0.82 | 3 | 0.5 |
| 3 | 9 | 7.5 | 11 | 4.123 | 0.82 | 3 | 0.5 |
| 4 | 9 | 7.5 | 11 | 4.123 | 0.82 | 3 | 0.5 |
Illustration of lag for 20.07.2013 in Lausanne:
Could a similar relationship be confirmed by examining correlations between the daily maximum values of radiation and ozone?
A correlation or lagged correlation in x and y may also be observed. For examples, a correlation between \(x\) and \(y\) may not be due to the causal relationship between \(x\) and \(y\), but dependent on a third variable, \(z\). This is written:
|
\[ \begin{aligned} y_t &= f_y(z_t)\\ x_t &= f_x(z_t) \end{aligned} \] |
\[ \begin{aligned} y_{t+k} &= f_y(z_t)\\ x_{t} &= f_x(z_t) \end{aligned} \] |
“Correlation does not imply causation”:
library(dplyr)
library(reshape2)
library(chron)
library(ggplot2)
source("GRB001.R")
Sys.setlocale("LC_TIME","C")
options(stringsAsFactors=FALSE)
options(chron.year.abb=FALSE)
theme_set(theme_bw()) # just my preference for plots
Let us load data saved from Lesson 4.
df <- readRDS("data/2013/lau-zue.rds")
variables <- c("O3", "NO2", "CO", "PM10", "SO2", "NMVOC", "EC", "TEMP", "PREC", "RAD")
Before computing correlations, we will first visualize the relationships using scatter plots (or x-y plots).
Let us calculate the daily maximum values for each variable:
lf <- melt(df, measure.vars=variables)
dm <- lf %>% group_by(site, year, month, day, season, variable) %>%
summarize(value=max(value, na.rm=TRUE))
daily.max <- dcast(dm, site + year + month + day + season ~ variable)
rm(dm) # no longer needed
Recall the relationship between temperature and O3 shown in a previous lecture. Note the seasonal dependence of this relationship.
ggp <- ggplot(daily.max)+
facet_grid(site~season)+
geom_point(aes(TEMP,O3))
print(ggp)
The corresponding correlation values can be obtained with the built-in function, cor (see `?cor).
(cor.values <- daily.max %>% group_by(site, season) %>%
summarize(correlation=cor(TEMP, O3)))
## Source: local data frame [8 x 3]
## Groups: site [?]
##
## site season correlation
## (chr) (fctr) (dbl)
## 1 LAU DJF 0.08360881
## 2 LAU MAM 0.52239691
## 3 LAU JJA 0.80040012
## 4 LAU SON 0.52610401
## 5 ZUE DJF 0.20129372
## 6 ZUE MAM 0.62341179
## 7 ZUE JJA 0.84064503
## 8 ZUE SON 0.65368411
We can format this table and save to file:
(cor.values.wf <- dcast(cor.values, season~site, value.var="correlation"))
## season LAU ZUE
## 1 DJF 0.08360881 0.2012937
## 2 MAM 0.52239691 0.6234118
## 3 JJA 0.80040012 0.8406450
## 4 SON 0.52610401 0.6536841
write.csv2(cor.values.wf, "correlations.csv", row.names=FALSE)
write.csv2 writes to a CSV (comma separated value) file using the European convention of defining delimiters as semicolons (;) rather than commas (,).
Or, we can visualize the values:
ggp <- ggplot(cor.values)+
geom_bar(aes(season, correlation), stat="identity")+
scale_y_continuous(limits=c(0,1))+
facet_grid(.~site)
print(ggp)
We can examine the correlation of other pollutants or meterological variables with O3.
lf <- melt(daily.max, measure.vars=setdiff(variables, "O3")) # everything but ozone
ggp <- ggplot(lf)+
facet_grid(site~variable, scale="free_x")+
geom_point(aes(value, O3, group=season, color=season), shape=4)
print(ggp)
We will next look at a scatter plot between hourly measurements of CO and NO2. Why does this relationship exist?
ggp <- ggplot(df)+
facet_grid(site~season)+
geom_point(aes(CO, NO2))
print(ggp)
For the following scatterplot matrix, we will use a package called lattice which is a plotting system that exists in parallel to R’s base graphics and ggplot2 package (confusing, I know). We additionally define a function to include correlation coefficients in our panels.
library(lattice)
CorrelationValue <- function(x, y, ...) {
correlation <- cor(x, y,use="pairwise")
if(!is.na(correlation)) {
cpl <- current.panel.limits()
panel.text(mean(cpl$xlim),mean(cpl$ylim),
bquote(italic(r)==.(sprintf("%.2f",correlation))),
adj=c(0.5,0.5),col="blue")
}
}
Plot daily maximum values as pairwise points (only Lausanne) using this syntax:
ix <- grepl("LAU", daily.max[["site"]], fixed=TRUE)
spp <- splom(~daily.max[ix,c("O3","NO2","CO","PM10","TEMP","PREC","RAD")] | daily.max[ix,"season"],
upper.panel = CorrelationValue,
pch=4)
print(spp)
Plot hourly data as pairwise points. Given the large number of points, we can “smooth” the visual representation.
ix <- grepl("LAU", df[["site"]], fixed=TRUE)
spp <- splom(~df[ix,c("O3","NO2","CO","PM10","TEMP","PREC","RAD")] | df[ix,"season"],
upper.panel = CorrelationValue,
panel = panel.smoothScatter,
pch=4)
print(spp)
Notice some values of correlation went up.
We can define a function to generate lagged pairs for \(k\) intervals and apply it using the do() function. If we provide hourly measurements, \(k=1\) corresponds to a lag of 1 hour, and so on.
Lag <- function(pair, k) {
out <- data.frame(lag=k, head(pair[,1],-k), tail(pair[,2],-k))
names(out)[2:3] <- colnames(pair)
out
}
lagged <- df %>%
group_by(site, season) %>%
do(rbind(Lag(.[,c("RAD","O3")], 1),
Lag(.[,c("RAD","O3")], 2),
Lag(.[,c("RAD","O3")], 3),
Lag(.[,c("RAD","O3")], 4),
Lag(.[,c("RAD","O3")], 5),
Lag(.[,c("RAD","O3")], 6)))
ggp <- ggplot(lagged) +
geom_point(aes(RAD, O3, group=site, color=site), shape=4)+
facet_grid(lag~season)
print(ggp)
R includes a function, ccf, for calculating the cross correlation. The value passed to the na.action=na.pass says to compute covariances from the complete cases when there are missing values. Calling this function will generate a corresponding plot by default (which can be turned off). See ?ccf for further details.
ccf(df[,"RAD"], df[,"O3"], lag.max=6, na.action=na.pass)
This tells us that the correlation is higest at a lag of -3 hours.
We can wrap this in a function that returns information that we want in a data frame (note plot=FALSE argument):
LaggedCorrelation <- function(pair, ...) {
out <- ccf(pair[,1], pair[,2], ..., na.action=na.pass, plot=FALSE)
data.frame(lag=out[["lag"]], value=out[["acf"]])
}
lagged.values <- df %>% group_by(season) %>%
do(LaggedCorrelation(.[,c("RAD","O3")], lag.max=6))
Note the syntax, ... in the function definition This passes on any additional arguments (lag.max in this case) from the outer function (LaggedCorrelation) to the inner-most function (ccf). ?... will refer you to the Introduction to R manual for further details on this syntax.
Plot:
ggp <- ggplot(lagged.values)+
geom_segment(aes(x=lag,xend=lag,y=0,yend=value))+
facet_grid(.~season)+
xlab("lag (hours)")+
ylab("Cross correlation coefficient")
print(ggp)